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Abstract 

Background: The analysis of massive high throughput data via clustering algorithms is very important for 
elucidating gene functions in biological systems. However, traditional clustering methods have several drawbacks. 
Biclustering overcomes these limitations by grouping genes and samples simultaneously. It discovers subsets of 
genes that are co-expressed in certain samples. Recent studies showed that biclustering has a great potential in 
detecting marker genes that are associated with certain tissues or diseases. Several biclustering algorithms have 
been proposed. However, it is still a challenge to find biclusters that are significant based on biological validation 
measures. Besides that, there is a need for a biclustering algorithm that is capable of analyzing very large datasets 
in reasonable time. 

Results: Here we present a fast biclustering algorithm called DeBi (Differentially Expressed Biclusters). The 
algorithm is based on a well known data mining approach called frequent itemset. It discovers maximum size 
homogeneous biclusters in which each gene is strongly associated with a subset of samples. We evaluate the 
performance of DeBi on a yeast dataset, on synthetic datasets and on human datasets. 

Conclusions: We demonstrate that the DeBi algorithm provides functionally more coherent gene sets compared 
to standard clustering or biclustering algorithms using biological validation measures such as Gene Ontology term 
and Transcription Factor Binding Site enrichment. We show that DeBi is a computationally efficient and powerful 
tool in analyzing large datasets. The method is also applicable on multiple gene expression datasets coming from 
different labs or platforms. 



Background 

In recent years, various high throughput technologies 
such as cDNA microarrays, oligo-microarrays and 
sequence-based approaches (RNA-Seq) for transcrip- 
tome profiling have been developed. The most common 
approach for detecting functionally related gene sets 
from such high throughput data is clustering [1]. Tradi- 
tional clustering methods like hierarchical clustering [2] 
and k-means [3], have several limitations. Firstly, they 
are based on the assumption that a cluster of genes 
behaves similarly in all samples. However, a cellular pro- 
cess may affect a subset of genes, only under certain 
conditions. Secondly, clustering assigns each gene or 
sample to a single cluster. However, some genes may 
not be active in any of the samples and some genes may 
participate in multiple processes. 
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Biclustering is a two-way clustering method for detect- 
ing local patterns in data. It finds subsets of genes that 
behave similarly in subsets of samples. Biclustering was 
initially introduced by Hartigan [4]. However, it was first 
applied by Cheng and Church [5] on gene expression 
data. Cheng and Church tried to identify submatrices of 
low mean residue score which indicates uniform fluctua- 
tion in expression profiles. Since the algorithm discovers 
one bicluster at a time, repeated application of the 
method on a modified matrix is needed for discovering 
multiple biclusters. This has the drawback that it results 
in highly overlapping gene sets. Ben-Dor et al. [6] 
detected a subset of genes whose expression levels 
induce the same linear ordering of the experiments. The 
drawback of this method is that it enforces a strict 
order of the samples. Bergmann et al. [7] identified 
biclusters which consist of the set of co-regulated genes 
and the conditions that induce their co-regulation. Mur- 
ali and Kasif [8] found subsets of genes that are 
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simultaneously similarly expressed across a subset of the 
samples. The algorithm uses prior knowledge about the 
sample phenotypes. Tanay et al [9] defined biclustering 
as a problem of finding bicliques in a bipartite graph. 
Due to its high complexity, the number of rows the 
bicluster may have is restricted. Prelic et al. [10] defined 
the binary inclusion maximal biclustering (BIMAX) 
using a fast divide and conquer method. However, 
divide and conquer has the drawback of possibly miss- 
ing good biclusters by early splits. Li et al. [11] devel- 
oped an algorithm for discovering statistically significant 
biclusters from datasets containing tens of thousands of 
genes and thousands of conditions. Madeira and Oli- 
veira have written a detailed review on different biclus- 
tering methods [12]. 

Here, we propose a novel, fast biclustering algorithm 
called DeBi that utilizes differential gene expression ana- 
lysis. In DeBi, a bicluster has the following two main 
properties. Firstly, a bicluster is a maximum homoge- 
nous gene set where each gene in the bicluster should 
be highly or lowly expressed over all the bicluster sam- 
ples. Secondly, each gene in the bicluster shows statisti- 
cal difference in expression between the samples in the 
bicluster and the samples not in the bicluster. Differen- 
tially expressed biclusters lead to functionally more 
coherent gene sets compared to standard clustering or 
biclustering algorithms. 

There are several advantages of the DeBi algorithm. 
Firstly, the algorithm is capable of discovering biclusters 
on very large datasets such as the human connectivity 
map data with 22283 genes and 6100 samples in reason- 
able time. Secondly, it is not required to define the 
number of biclusters a priori [5,7,10]. 

We evaluated the performance of DeBi on a yeast 
dataset [13], on synthetic datasets [10], on the connec- 
tivity map dataset which is a reference collection of 
gene expression profiles from human cells that have 
been treated with a variety of drugs [14], gene expres- 
sion profiles of 2158 human tumor samples published 
by expO (Expression Project for Oncology), on diffuse 
large B-cell lymphoma (DLBCL) dataset [15] and on 
gene sets from the Molecular Signature Database 
(MSigDB) C2 category. We show that DeBi compares 
well with existing biclustering methods such as BIMAX, 
SAMBA, Cheng and Church's algorithm (CC), Order 
Preserving Submatrix Algorithm (OPSM), Iterative Sig- 
nature Algorithm (ISA) and Qualitative Biclustering 
(QUBIC) [5-7,9,10]. 

Results 

We have evaluated our algorithm on six datasets (a) 
Prelic's benchmark synthetic datasets with implanted 
biclusters [10] (b) 300 different experimental pertur- 
bations of S. cerevisiae [13] (c) diffuse large B-cell 



lymphoma (DLBCL) dataset [15] (d) a reference collec- 
tion of gene-expression profiles from human cells that 
have been treated with a variety of drugs [14] (e) gene 
expression profiles of 2158 human tumor samples pub- 
lished by expO (Expression Project for Oncology) (f) 
gene sets from the Molecular Signature Database 
(MSigDB) C2 category. The synthetic data is studied to 
show the performance of our algorithm in recovering 
implanted biclusters. Additionally, the effect of overlap 
between biclusters and noise on the performance of the 
algorithm can be studied using the synthetic data. The 
yeast and human gene expression datasets are studied to 
evaluate the biological relevance of the biclusters from 
several aspects. We used a fold-change of 2 for binariz- 
ing the datasets. The set of biclusters generated by all 
the algorithms are filtered such that the remaining 
ones have a maximum overlap of 0.5. (unless specified 
otherwise) 

First, for each bicluster we calculated the statistically 
significantly enriched Gene Ontology (GO) terms using 
the hypergeometric test. We determined the proportion 
of GO term enriched biclusters at different levels of sig- 
nificance. Second, Transcription Factor Binding Sites 
(TFBS) enrichment is calculated by a hypergeometric 
test using transcription factor binding site data coming 
from various sources [16-18] at different levels of signifi- 
cance. The GO term and TFBS enrichment analyses are 
done using Genomica http://genie.weizmann.ac.il. 

We have compared our algorithm with BIMAX, 
SAMBA, Cheng and Churchs algorithm (CC), Order 
Preserving Submatrix Algorithm (OPSM), Iterative Sig- 
nature Algorithm (ISA) and Qualitative Biclustering 
(QUBIC) [5-7,9,10]. We used QUBIC software for 
QUBIC, BicAT software for OPSM, ISA, BIMAX and 
Expander software for SAMBA with default settings for 
each algorithm [10,19,20]. 

Prelic's Synthetic Data 

We applied our algorithm to a synthetic gene expression 
dataset. In the artificial datasets biclusters have been 
created on the basis of two scenarios (data available at 
http://www.tik.ee.ethz.ch/sop/bimax. In the first sce- 
nario, non-overlapping biclusters with increasing noise 
levels are generated. In the second scenario, biclusters 
with increasing overlap but without noise are produced. 
In both scenarios, biclusters with constant expression 
values and biclusters following an additive model where 
the expression values varying over the conditions are 
investigated. 

In order to assess the performance of different biclus- 
tering algorithms, we used two measures from Prelic et 
al. [10] and Hochreiter et al. [21], respectively. The mea- 
sure introduced by Prelic et al. calculates a similarity 
based on the Jaccard index between the computed 
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biclusters and the implanted biclusters. Bicluster recov- 
ery score measures the accuracy of the predicted biclus- 
ters however it does not consider the number of 
biclusters in both sets. Hochreiter et al. introduced a 
consensus score by computing similarities between all 
pairs of biclusters and then assigning the biclusters of 
one set to biclusters of the other set. It penalizes differ- 
ent number of biclusters by dividing the sum of similari- 
ties by the numbers of biclusters in largest set. A more 
detailed description of the measures can be found in 
Additional File 1. 

In Figures 1 and 2 the performance of BIMAX, ISA, 
SAMBA, DeBi, OPSM and QUBIC algorithms on the 
synthetic data is summarized based on Prelic et al. 
recovery score and Hochreiter et al. consensus score. 
The set of biclusters generated by these algorithms are 
filtered such that the remaining ones have a maximum 
overlap of 0.25. In the Prelic et al. paper, after the filter- 
ing process the largest 10 biclusters are chosen. Since 
the bicluster number is not known a priori, we have 
considered all the filtered biclusters. We did not evalu- 
ate xMotif and CC algorithms since they have been 
shown to perform badly in all the scenarios, mostly 
below 50% of recovery accuracy [10]. The CC and xMo- 
tif algorithms produce large biclusters containing genes 
that are not expressed. ISA and QUBIC give high Prelic 
et al. recovery score and Hochreiter et al. consensus 
score in all scenarios. SAMBA has a lower Hochreiter et 
al. consensus score compared to its Prelic et al. recovery 
score. The reason is that, Hochreiter et al. consensus 
score takes into account both gene and condition 
dimensions and SAMBA is not very accurate in recover- 
ing the biclusters in condition dimension. In the absence 
of noise with an increasing overlap degree, BIMAX has 
a high performance based on Prelic et al. and Hochreiter 
et al. scores. However, BIMAX estimates a large number 
of biclusters upon increasing noise level. The compari- 
sion of the estimated number of biclusters given by the 
algorithms with the true number of biclusters under all 
the scenarios can be found in Figure SI in Additional 
File 1. In the absence of overlap with increasing noise 
levels, DeBi is able to identify 99% of implanted biclus- 
ters both in additive and constant model. High degree 
of overlap decreases the performance of DeBi because it 
considers the overlapping part of the biclusters as a 
seperate bicluster. The DeBi biclustering results can be 
found in Additional file 2. 

Yeast Compendium 

We further applied our algorithm to the compendium of 
gene expression profiles derived from 300 different 
experimental perturbations of S. cerevisiae [13]. We dis- 
covered 192 biclusters in the yeast dataset containing 



2025 genes and 192 conditions. As a binarization level 
we used the fold change of 1.58 as recommended in the 
original paper [13]. 

Figure 3 (a) illustrates the proportion of GO term and 
TFBS enriched biclusters for the six selected biclustering 
methods (ISA, OPSM, BIMAX, QUBIC, SAMBA and 
DeBi) at different levels of significance. DeBi performs 
the second best based on biological validation measures. 
BIMAX discovers a higher proportion of GO term and 
TFBS enriched biclusters. All the biclusters, the enrich- 
ment analysis can be found in Additional file 3. 

In the analyzed yeast data, conditions are knocked-out 
genes. Since biclustering discovers subsets of genes and 
subsets of conditions we can also examine the biological 
significance of the clustered conditions. Similar to the 
previous analysis, we measured GO term enrichment of 
conditions in each discovered biclusters. DeBi is the sec- 
ond best in discovering high percentage of GO term 
enriched biclusters. 

In the discovered biclusters, the enriched gene func- 
tions are related to the enriched sample functions. 
Bicluster 83, genes are enriched in the conjugation' GO 
term and conditions are enriched in 'regulation of biolo- 
gical quality'. Moreover, there is an enrichment of the 
TFBS of STE12, which is known to be involved in cell 
cycle. Bicluster 50, consists of genes and samples that 
are enriched in 'ribosome biogenesis and assembly' GO 
term. Bicluster 22, consists of genes and samples that 
are enriched in 'lipid metabolic process' GO term, and 
additionally genes are enriched with TFBS of HAP1. 
Bicluster 9, consists of down regulated genes and sam- 
ples that are enriched in 'cell division' GO term, and 
additionally genes are enriched with TFBS of STE12. 

DLBCL Data 

We also evaluated our DeBi algorithm on 'diffuse large 
B-cell lymphoma' (DLBCL) dataset. DLBCL dataset con- 
sists of 661 genes and 180 samples. We applied ISA, 
OPSM, QUBIC, SAMBA and DeBi algorithms. 

Figure 3 (b) illustrates the proportion of GO term and 
TFBS enriched biclusters for the five biclustering meth- 
ods at different levels of significance. DeBi discovers the 
highest proportion of GO term and TFBS enriched 
biclusters. The up regulated bicluster 16 and down regu- 
lated bicluster 4 contains the sample classes identified 
by [22]. Bicluster 16 is enriched with 'ribosome' and 'cell 
cycle' GO Term and Bicluster 4 is enriched with 'cell 
cycle' and 'death' GO Terms. The protein interaction 
networks of this two selected biclusters can be found in 
Figure S2 and S3 Additional File 1. Protein interaction 
networks are generated using STRING [23]. All the 
biclusters and the enrichment analysis can be found in 
Additional file 4. 



Serin and Vingron Algorithms for Molecular Biology 201 1, 6:18 
http://www.almob.Org/content/6/1 1\ 8 



Page 4 of 1 2 



Effect of Noise: Relevance of BC's (Constant) 



Effect of Noise: Relevance of BC's (Additive) 
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Regulatory Complexity: Relevance of BC's (Additive) 
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Figure 1 Bicluster recovery accuracy score on synthetic data. The synthetic data have been created based on two scenories (a) and (b) with 
increasing noise level, constant and additive model respectively, (c) and (d) with increasing degree of overlap, constant and additive model respectively. 



Human CMap Data 

We also evaluated our DeBi algorithm on the Connec- 
tivity Map v0.2 (CMap) [14]. CMap is a reference collec- 
tion of gene expression profiles from human cells that 
have been treated with a variety of drugs comprised of 
6100 samples and 22283 genes. Figure 3 (c) summarizes 
the results of DeBi and QUBIC. The proportion of GO 
term and TFBS enriched biclusters are much more 
higher in DeBi compared to QUBIC. 

The biclusters discovered by DeBi can be used to find 
drugs with a common mechanism of action and identify 
new therapeutics. Moreover, we can observe the effect 



of drugs on different cell lines. Figure 4 shows parallel 
coordinate plots of some of the identified biclusters. In 
parallel coordinate plots, the profile of the conditions 
that are included in a bicluster are shown as black, the 
other conditions as gray. This aids to visualize the 
expression difference between the conditions in a biclus- 
ter compared to the rest of the conditions. The bicluster 
6, contains up regulated 'heat shock protein binding' 
genes and 'heat shock protein inhibitors' such as gelda- 
namycin, alvespimycin, tanespimycin, monorden. Heat 
shock proteins (Hsps) are overexpressed in a wide range 
of human cancers and are involved in tumor cell 
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Effect of Noise: (Constant) 



Effect of Noise: (Additive) 
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Figure 2 Bicluster consensus score on synthetic data. The synthetic data have been created based on two scenories (a) and (b) with increasing 
noise level, constant and additive model respectively, (c) and (d) with increasing degree of overlap, constant and additive model respectively. 



proliferation [24]. Additionally, genes in the bicluster are 
enriched with 'P53 binding site', which is known to tar- 
get heat shock protein binding genes. Bicluster 11, con- 
tains up regulated genes enriched with 'cadmium ion 
binding' GO Term and calcium-binding protein inhibi- 
tors, calmidazolium. Bicluster 15, contains up regulated 
genes enriched with 'transcription corepressor activity' 
GO Term. Cell lines in this bicluster are all breast can- 
cer. Bicluster 14, contains down regulated genes 
enriched with 'steroid hormone signalling' GO Term. 



Additionally, protein interaction networks of the 
selected biclusters are strikingly connected and they can 
be found in Figure S4, S5, S6 and S7 in Additional File 
1. All the biclusters and the enrichment analysis can be 
found in Additional file 5. 

Human ExpO Data 

We applied our DeBi algorithm and QUBIC on Expres- 
sion Project for Oncology(expO) dataset http://www.int- 
gen.org/. ExpO consists gene expression profiles of 2158 
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(d) GO and TFBS Enrichment of ExpO biclusters 
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Figure 3 Biological Significance of Yeast, DLBCL, CMap, ExpO Biclusters. GO and TFBS enrichment of yeast, dlbcl, CMap and ExpO biclusters. 



human tumor samples coming from diverse tissues with 
40223 transcripts. 

Figure 3 (d) shows that the proportion of GO term 
and TFBS enriched biclusters are much more higher in 
DeBi compared to QUBIC. It illustrates that DeBi 
performs better than QUBIC in ExpO data. 70% of the 
DeBi biclusters are enriched with GO Terms with a 
p-value smaller than 0.05. Moreover biclusters contain 
tumor samples mostly from similar tissue types. Figure 
S8 in Additional file 1 shows GO Term enrichment of 
some of the biclusters. Bicluster 13 contains thyroid 



tumor samples and genes enriched with 'protein-hor- 
mone receptor activity'. Bicluster 3 contains prostate 
tumor samples and genes enriched with 'tissue kallikrein 
activity'. Bicluster 22 contains mostly pancreas and 
colon samples and genes enriched with pancreatic elas- 
tase activity' GO Term. All the biclusters and the 
enrichment analysis can be found in Additional file 6. 

MSigDB Data 

Finally, we applied our algorithm on the manually curated 
gene sets from the Molecular Signature Database 
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Figure 4 Example CMap biclusters identified using DeBi Algorithm. Parallel coordinate plots of some of the identified CMap biclusters using 
the DeBi algorithm. In parallel coordinate plots, the profile of the conditions that are included in a bicluster are shown as black, the other 
conditions as gray. 



(MSigDB) C2 category. The C2 category of MSigDB con- 
sists of 3272 gene sets in which 2392 gene sets are chemi- 
cal and genetic pertubations and 880 gene sets are from 
various pathway databases. The gene sets naturally define 
a binary matrix where ones indicate the affected gene 
under certain pertubation/pathway. The binary matrix 
contains 18205 genes and 3272 samples. This analysis aids 
us to identify the pathways that are affected by chemical 
and genetic perturbations. It has not been possible to run 
QUBIC on this dataset while QUBIC requires a certain 
amount of overlap between genes. 



Figure 5, illustrates all the biclusters using BiVoc 
algorithm [25]. BiVoc algorithm rearranges rows and 
conditions in order to represent the biclusters with the 
minimum space. The output matrix of BiVoc, may have 
repeated rows and/or columns from the original matrix. 
In Figure 5, the function of each bicluster is specified 
based on GO Term enrichment. Bicluster 3, contains 
the down-regulated gene set from Alzheimer patients 
and gene set from proteasome pathway. It is known that 
there is a significant decrease in proteasome activity in 
Alzheimer patients [26]. Bicluster 3 also contains the 
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BC1, BC4: kinase activity 

BC2, BC5, BC12: mitosis 

BC3: proteolysis 

BC6: lipid kinase activity 

BC7, BC11, BC15: DNA replication 

BC8: G-protein coupled receptor 

protein signaling pathway 

BC9: DNA-directed 

RNA polymerase II, holoenzyme 

BC13: DNA repair 

BC14: cadmium ion binding 



Figure 5 MSigDB biclusters identified using DeBi Algorithm. 



up-regulated gene set from pancreatic cancer patients. 
In previous studies, high activity of ubiquitin-protea- 
some pathway in pancreatic cancer cell line was 
detected [27]. Bicluster 8 contains up-regulated gene set 
from liver cancer patients and gene set from G-protein 
activation pathway. Dysfunction of G Protein-Coupled 
Receptor signaling pathways are involved in certain 
forms of cancer. All the biclusters and the enrichment 
analysis can be found in Additional file 7. 

Running Time 

DeBi algorithm is capable of analyzing yeast data(size 
6100 x 300) in 6 minutes, ExpO data (size 40223 x 
2158) in 12 minutes, MSigDB data (size 18205 x 3272) 
in 11 minutes, DLBCL data (size 610 x 180) in 11 sec- 
onds, CMap data (size 22283 x 6100) in 3 hours 45 
minutes. The QUBIC algorithm analyzes CMap data in 
2 hours 55 minutes and ExpO data in 3 hours 54 min- 
utes. The running time analysis was done on a 2.13 
GHz Intel 2 Dual Core computer with 2GB memory. 

Methods 

Given an expression matrix E with genes G ={gi, g 2 > 
g3>-~> gn) an d samples S ={si, s 2 , s 3 ,..., s m } a bicluster is 
defined as b = {G\ S') where G' <= G is a subset of genes 
and S' <= S is a subset of samples. DeBi identifies func- 
tionally coherent biclusters B ={b l7 b 2 , b 3 ,..., b t } in three 
steps. Below we describe each step in detail. An over- 
view of the DeBi algorithm is shown in Figure 6. The 
DeBi algorithm is based on a well known data mining 
approach called Maximal Frequent Item Set [28]. We 
will refer to this as Maximal Frequent Gene Set, as 
given by our problem definition. The pseudocode of the 
algorithm is in Additional file 1. 



Preliminaries 

The input gene expression data is binarized according to 
either up or down regulation. Let E u and E d denote the 
up and down regulation binary matrices, respectively. 
Then the entries d» of E u are defined as follows: 

u _ J 1 if gene i is c fold up regulated in sample j 
e *i ~ \ 0 otherwise 

and the entries of ef- of E d are defined analogously 
with a c-fold down-regulation cut-off. The fold change 
cut-off c will typically be set to 2. 

Finding seed biclusters by Maximal Frequent Gene Set 
Algorithm 

The DeBi algorithm, identifies the seed gene sets by 
iteratively applying the maximal frequent gene set algo- 
rithm. We first define the term support, which we will 
later use in the algorithm. The support of the gene g if 
i = 1 n, is defined as follows: 

j m 

™PP(gi) = -J2 e v P) 

In other words, the support is the proportion of sam- 
ples for which the gene-vector e t . is 1. This is further 
extended to sets of genes. Let G v = {g\, ... ,gk} be the 
v th gene-set. For a set of gene-vectors we define their 
phenotype vector C v as their element-wise logical AND: 

C v = A(e L/ ...,<&.) (3) 

The support of the gene set is then defined as the 
fraction of samples for which the phenotype vector is 1. 

A gene set G f v is (cj, c 2 ) - frequent iff its support supp 
(G f v ) is larger than c 1 and the cardinality \G f v \ above c 2 . 
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to up regulation 
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gene 2 to down regulation 
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FIND UP REGULATED BICLUSTERS 

Step 1: Find seed biclusters 

1 st iteration 

apply MAFIA with support =10/20 



FIND DOWN REGULATED BICLUSTERS 

Step 1: Find seed biclusters 

1st iteration 

apply MAFIA with support =7/20 




in each iteration 
— ^ exclude discovered 

genes from search space 



gene 8 
gene 9 
gene 10 

gene 12 
gene 13 
gene 14 
gene 15 



gene 13 
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2nd iteration 

apply MAFIA with support=6/20 



2nd iteration 

apply MAFIA with support =9/20 



Is; 
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3rd iteration 

apply MAFIA with support=4/20 
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Step 2: Extending 



gene 6 
gene 7 
gene 8 



gene 14 is added to 
the purple bic luster 



gene 9,10,11 is ad. 
the orange bicluste 



Step 2: Extending & Step 3: Filtering 




gene 1 5 is added to 
the red bicluster 



gene 1 3 is deleted 
from the blue bicluster 



tttttt 




Step 3: Filtering 




Q.'E. O. g. tXO. <X~S. CL^ 



tttttt 



Figure 6 Illustration of DeBi algorithm. The algorithm is ran on two different binarized datasets. One is the binarized data based on up 
regulation and the other is the binarized data based on down regulation. In Step 1, seed biclusters identified within each support value going 
from high to low. For the binarized data based on up regulation, in the 1st iteration, red gene set with support value 10/20 is detected and 
excluded from the search space. Similarly, in the second and third iterations yellow and blue clusters with support values, respectively 6/20 and 
4/20, are found. In Step 2, seed gene sets are improved based on genes' association strength. Gene 15 is added to the red bicluster because the 
p-value returned by the Fisher exact test is smaller than a and gene 13 is deleted because the p-value returned by the Fisher exact test is 
higher than a. None of the discovered biclusters have an overlap of the gene x sample area of more than 50%. 
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When c 1 and c 2 are not in focus, we will simply speak of 
a frequent gene set. A gene set is maximally frequent iff 
it is frequent and no superset of it is frequent. 

The simplest method for detecting maximally frequent 
gene sets is a brute force approach in which each possi- 
ble subset of G ={gi, g2, g3>—> gj is a candidate frequent 
set. To find the frequent sets we count the support of 
each candidate set. The MAFIA algorithm is an efficient 
implementation for finding maximally frequent sets with 
support above a given threshold [28]. The search strat- 
egy of MAFIA uses a depth-first traversal of the gene 
set lattice with effective pruning techniques. It avoids 
exhaustive enumeration of all candidate gene sets by a 
monotonicity principle. The monotonicity principle 
states that every subset of a frequent itemset is frequent. 
It prunes the candidates which have an infrequent sub- 
pattern using this property. 

In the first step of the DeBi algorithm, MAFIA is 
iteratively applied to the binary matrix successively 
reducing the support threshold. Initially, MAFIA is 
applied to the full binary matrix E u (E d ) with support 
value (cj )o equal to support value of the gene with the 
highest support. In iteration /<, MAFIA is applied with 

support value threshold of (ci)k = (ci)fc-i — — . The 

m 

identified maximally frequent sets are added to the set 
of seed gene sets B and the genes in B are deleted from 
the binary matrix E u (E d ). In each iteration MAFIA is 
applied to the modified matrix E u (E d ). The process is 
repeated until a user defined minumum support para- 
meter is reached. 

Extending and filtering the biclusters 

In the second step of DeBi, the identified seed gene sets 
G\ = {GpG^ . . . GJ} are extended using a local search. 
For each bicluster B v = (G f v , S f v \ v = 1,...,/, we have the 
binary phenotype vector C v = /\{e ll ... 1 e^) = (C vl ,...,C vm ). 
The entries of C v indicate the indices of the bicluster 
samples. If C y; = 1 =>► Sj e S v , j = , i.e. that the 

sample Sj belongs to the bicluster b v . The gene g u i = 
i,..., n, is an element of gene set G f v if e t . is associated 
with C v . We evaluate the association strength between 
the phenotype vector of a bicluster and another gene 
using Fisher's exact test on a 2 x 2 contingency table. 
The cells of the contingency table count how often the 
four possibilities of the phenotype vector containing a 1 
or a 0 and the gene-vector containing a 1 or a 0 occur. 
The Fisher's exact test then tests for independence in 
the contingency table and thus among the two vectors. 

A gene g b i = 7,..., n is added, to the gene set G v if the 
pvalue Pgi returned by the Fisher exact test is lower than 
the parameter a. It gets deleted from b v if the probabil- 
ity is higher than a and added to b v if the probability is 



smaller than a. For this procedure the association prob- 
ability Pgi with the bicluster needs to be calculated for 
each gene. However, we reduce the computational effort 
using the monotonicity property of the hypergeometric 
distribution. We precompute cut-off values on the con- 
tingency table entries that yield a p-value just higher 
than a. Let o 1} IN and o 1} out denote the number of l's 
a gene-vector has in the bicluster samples and the num- 
ber of l's a gene-vector has outside the bicluster sam- 
ples, respectively. We find the minimal G 1> IN and 
maximal c 1} out at this border. Then, we apply Fisher's 
exact test only to those genes which have o h IN >mino h 
IN and G 1} out <maxo h out- 
Ivy the last step we turn to the sometimes very compli- 
cated overlap structure among biclusters. The goal is to 
filter the set of biclusters such that the remaining ones 
are large and overlap only little. The size of a bicluster is 
defined as the number of genes times the number of 
samples in the bicluster, \G' V \ x \S' V \. Two biclusters over- 
lap when they share common samples and genes. The 
size of the overlap is the product of the number of com- 
mon samples and common genes. To filter out biclusters 
that are largely contained in a bigger bicluster, we start 
with the largest bicluster and compare it to the other 
biclusters. Those biclusters for which the overlap to the 
largest one exceeds L% (typically 50%) of the size of the 
smaller one are deleted. This is then repeated starting 
with the remaining second-largest bicluster and so on. 

Choosing the optimum alpha parameter 

To formulate an optimality criterion for a one requires 
an inherent measure of the quality of a set of biclusters. 
To this end, for a bicluster v, we define its score I v as 
the negative sum of the log p-values of the included 
genes, where the individual p g is the p-value from the 
Fisher exact test: 

l v = - OogPs) (4) 

geG' v V 7 

However, this bicluster score I v depends on the size 
(number of genes x number of conditions) of the biclus- 
ter and in order to make it comparable between biclus- 
ters one needs to correct for the size. We compute the 
expected bicluster score through a randomization proce- 
dure. A large number, say 500, random phenotype vec- 
tors having the same number of Is as the bicluster has 
conditions is generated. For these random phenotype 
vectors a Fisher exact test p-value with respect to each 
gene in the bicluster is computed. One obtains a ran- 
dom I v score by adding log p-values over the genes of 
the bicluster. The mean of these random bicluster 
scores is the desired estimator. Finally, a normalized NI V 
score is definded by dividing I v by this estimated mean 
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and the total biclustering score CS is defined as the sum 
of NI V normalized scores of all discovered biclusters 
CS = J2 V £ 1 [NI V ). This score serves to distinguish 
between different choices of a. The program is run 
under a = {10 2 , 10" 3 ,..., 10" 100 } and we choose the 
a that maximizes CS. 

Discussion 

We have proposed a novel fast biclustering algorithm 
especially for analyzing large datasets. Our algorithm 
aims to find biclusters where each gene in a bicluster 
should be highly or lowly expressed over all the biclus- 
ter samples compared to the rest of the samples. Unlike 
other algorithms, it is not required to define the number 
of biclusters a priori. We have compared our method 
with other biclustering algorithms using synthetic data 
and biological data. We have shown that the DeBi algo- 
rithm provides biologically significant biclusters using 
GO term and TFBS enrichment. We have also showed 
the computational efficiency of our algorithm. It is 
shown that it is a useful and powerful tool in analyzing 
large datasets. 

In spite of efforts by many authors, comparing the 
performance of biclustering algorithms is still a chal- 
lenge [29]. Smaller biclusters have a higher chance to 
yield a coherent GO annotation, while larger biclusters 
would, of course, be more interesting to observe. Our 
a threshold influences this behavior. The optimized 
a threshold is smaller for larger number of samples 
which limits the number of genes that get accepted into 
a bicluster. 

The binarization of the input data in order to obtain a 
boolean matrix is another key decision in our approach. 
In this we go along with many other authors and we 
think that it helps in applying biclustering to gene 
expression data coming from different labs or platforms. 
The hope is that our method will further contribute to 
establishing biclustering as a general purpose tool for 
data analysis in functional genomics. 

Implementation 

The DeBi code is written in C++ programming language 
for UNIX environment. The MAFIA algorithm C + + 
code is used for calculating the maximally frequent item 
sets. The DeBi algorithm is freely available at http:// 
www.molgen.mpg.de/~serin/debi/main.html. 

Additional material 



Additional file 1: Description of selected biclustering algorithms, 
description of MAFIA algorithm, protein protein interaction 
networks. 

Additional file 2: DeBi results on synthetic data. 



Additional file 3: DeBi, BIMAX, ISA, OPSM, SAMBA and QUBIC 
biclustering results and GO term, TFBS enrichment analysis of the 
genes and conditions in biclusters on yeast data. 

Additional file 4: DeBi, ISA, OPSM, SAMBA, QUBIC biclustering 
results and GO term, TFBS enrichment analysis of the genes on 
DLBCL data 

Additional file 5: DeBi and QUBIC biclustering results and GO term 
and TFBS enrichment analysis of the biclusters on CMap data. 

Additional file 6: DeBi and QUBIC biclustering results and GO term 
and TFBS enrichment analysis of the biclusters on ExpO data. 

Additional file 7: DeBi biclustering results and GO term and TFBS 
enrichment analysis of the biclusters on MSigDB data. 
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